In order to achieve the precise semantic correlation between image and text, an image text retrieval method based on Feature Enhancement and Semantic Correlation Matching (FESCM) was proposed. Firstly, through the feature enhancement representation module, the multi-head self-attention mechanism was introduced to enhance image region features and text word features to reduce the interference of redundant information to alignment of image region and text word. Secondly, the semantic correlation matching module was used to not only capture the corresponding correlation between locally significant objects by local matching, but also incorporate the image background information into the global image features and achieve accurate global semantic correlation by global matching. Finally, the local matching scores and global matching scores were used to obtain the final matching scores of images and texts. The experimental results show that the FESCM-based image text retrieval method improves the recall sum over the extended visual semantic embedding method by 5.7 and 7.5 percentage points on Flickr8k and Flickr30k benchmark datasets, respectively; the recall sum is improved by 3.7 percentage points over the Two-Stream Hierarchical Similarity Reasoning method on the MS-COCO dataset. The proposed method can effectively improve the accuracy of image text retrieval and realize the semantic connection between image and text.
In view of the problems that the upsampling process of U-Net is easy to lose details, and the datasets of stomach cancer pathological image are generally small, which tends to lead to over-fitting, an automatic segmentation model for pathological images of stomach cancer based on improved U-Net was proposed, namely EOU-Net. In EOU-Net, based on the existing U-Net model, EfficientNetV2 was used as the backbone, thereby enhancing the feature extraction ability of the network encoder. In the decoding stage, the relations between cell pixels were explored on the basis of Object-Contextual Representation (OCR), and the improved OCR module was used to solve the loss problem of the upsampled image details. Then, the post-processing of Test Time Augmentation (TTA) was used to predict the images obtained by rollover and rotations at different angles of the input image respectively, and then the prediction results of these images were combined by feature fusion to further optimize the output results of the network, thereby solving the problem of small medical datasets effectively. Experimental results on datasets SEED, BOT and PASCAL VOC 2012 show that the Mean Intersection over Union (MIoU) of EOU-Net is improved by 1.8, 0.6 and 4.5 percentage points respectively compared with that of OCRNet. It can be seen that EOU-Net can obtain more accurate segmentation results of stomach cancer images.
6 Degree of Freedom (DoF) pose estimation is a key technology in computer vision and robotics, and has become a crucial task in the fields such as robot operation, automatic driving, augmented reality by estimating 6 DoF pose of an object from a given input image, that is, 3 DoF translation and 3 DoF rotation. Firstly, the concept of 6 DoF pose and the problems of traditional methods based on feature point correspondence, template matching, and three-dimensional feature descriptors were introduced. Then, the current mainstream 6 DoF pose estimation algorithms based on deep learning were introduced in detail from different angles of feature correspondence-based, pixel voting-based, regression-based and multi-object instances-oriented, synthesis data-oriented, and category level-oriented. At the same time, the datasets and evaluation indicators commonly used in pose estimation were summarized and sorted out, and some algorithms were evaluated experimentally to show their performance. Finally, the challenges and the key research directions in the future of pose estimation were given.
To address the problems of low accuracy, difficult deployment and high calibration cost of visual manipulator in complex system environments, a robust joint modelling and optimization method for visual manipulators was proposed. Firstly, the subsystem models of the visual manipulator were integrated together, and the sample data such as servo motor rotation angles and manipulator end-effector coordinates were collected randomly in the workspace of the manipulator. Then, an Adaptive Multiple-Elites-guided Composite Differential Evolution algorithm with shift mechanism and Layered Optimization mechanism (AMECoDEs-LO) was proposed. Simultaneous optimization of the joint system parameters was completed by using the method of parameter identification. Principal Component Analysis (PCA) was performed by AMECoDEs-LO on stage data in the population, and with the idea of parameter dimensionality reduction, an implicit guidance for convergence accuracy and speed was realized. Experimental results show that under the cooperation of AMECoDEs-LO and the joint system model, the visual manipulator does not require additional instruments during calibration, achieving fast deployment and a 60% improvement in average accuracy compared to the conventional method. In the cases of broken manipulator linkages, reduced servo motor accuracy and increased camera positioning noise, the system still maintains high accuracy, which verifies the robustness of the proposed method.